- Feed-forward neural networks
- Recurrent neural networks
- SRN
- LSTM
- Bi-LSTM
- GRU
A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.
If you provide the system tons of information, it begins to understand it and respond in useful ways.
\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]
Optimize (min. or max.)
objective/cost function \(J\)\((\theta)\)
Generate
error signal that measures difference between predictions and target values
Use error signal to change the
weights and get more accurate predictions
Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function
Define objective to minimize error: \[E(W) = \sum_{d \in D} \sum_{k \in K} (t_{kd} - O_{kd})^2\]
where \(D\) is the set of training examples, \(K\) is the set of output units, \(t_{kd}\) and \(o_{kd}\) are, respectively, the teacher and current output for unit \(k\) for example \(d\).Learning rule to change weights to minimize \[\Delta w_{ji} = - \eta \frac{\partial{E}}{\partial{w_{ji}}}\]
Each weight changed by: \[\Delta w_{ji} = \eta \delta_j o_i \] \[\delta_j = o_j(1 - o_j)(t_j - o_j) \ \ \ \text{if } j \text{ is an output unit}\] \[\delta_j = o_j(1 - o_j)\sum_k{\delta_k w_{kj}} \ \ \ \text{if } j \text{ is a hidden unit}\]
where \(\eta\) is a constant called the learning rate
\(t_j\) is the correct teacher output for unit \(j\)
\(\delta_j\) is the error measure for unit \(j\)
objective/cost function \(J\)\((\theta)\)
Update each element of \(\theta\):
\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]
Matrix notation for all parameters ( \(\alpha\): learning rate):
\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]
Recursively apply chain rule though each node
Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)
Suppose we had the following scenario:
Day 1: Lift Weights
Day 2: Swimming
Day 3: At this point, our model must decide whether we should take a rest day or yoga. Unfortunately, it only has access to the previous day. In other words, it knows we swam yesterday but it doesn’t know whether had taken a break the day before.Therefore, it can end up predicting yoga.
\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]
\[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] \[\tilde{C}_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]
\[C_t = f_t * C_{t-1} + i_t * \tilde{C}_t\]
\[o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\] \[h_t = o_t * tanh(C_t)\]
POS Tagging
https://www.aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]